Welcome to SPICE 2023

This code chunk is where we load in all of the packages that we will use in this script using library(packagename)

Here is where we read in our data

penguins_data <- read_csv(here("data/penguins_data/penguins_lter.csv"))
## Rows: 344 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
## dbl  (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean column names of the dataframe using the clean_names function from the janitor package

penguins_data <- penguins_data %>% 
  clean_names()

Exploratory Analysis

Let’s begin by exploring some of the columns of our dataset

There is an island column - Let’s see how many different islands there are in the dataset and how many penguins from the study are on each island - We can use the ‘table’ function to accomplish this

table(penguins_data$island)
## 
##    Biscoe     Dream Torgersen 
##       168       124        52

Now, let’s look at the body mass column by creating a histogram

ggplot(data = penguins_data, aes(x = body_mass_g)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

This looks good, now let’s clean this up and look at body mass by species of penguin

ggplot(data = penguins_data, aes(x = body_mass_g,
                                 fill = species)) +
  geom_histogram(color = "black") +
  theme_minimal() +
  labs(title = "Penguins, Palmer Station LTER",
       subtitle = "Body Mass Distribution for Adelie, Chinstrap and Gentoo Penguins",
       x = "Body mass (g)",
       y = "Number of Penguins",
       color = "Penguin species")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

Now let’s look at the relationship between flipper length and body mass with a scatter plot

ggplot(data = penguins_data, aes(x = flipper_length_mm, 
                                 y = body_mass_g)) +
         geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

It looks like there is a strong positive correlation between body mass and flipper length - Let’s add some variables from the dataset - Island - Species - Sex

ggplot(data = penguins_data, aes(x = flipper_length_mm, 
                                 y = body_mass_g,
                                 color = species,
                                 shape = sex)) +
         geom_point() +
  theme_minimal() +
  labs(title = "Penguin size, Palmer Station LTER",
       subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin species",
       shape = "Penguin sex") +
  theme(axis.text.x = element_text(angle = 45)) +
  facet_grid(~island)
## Warning: Removed 10 rows containing missing values (`geom_point()`).

Now let’s explore the relationships between our numeric columns in the dataset with a correlation matrix

Select numeric columns for correlations

penguins_data_numeric <- penguins_data %>% 
  select(culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g)

Create a correlation matrix

cor_matrix <- cor(penguins_data_numeric[complete.cases(penguins_data_numeric), ], use = "pairwise.complete.obs")

Plot the correlation matrix

corrplot <- ggcorrplot(cor_matrix, type = "lower", outline.color = "white") +
  theme(axis.text.x = element_text(size = 3),
        axis.text.y = element_text(size = 3))
corrplot

Make the correlation plot interactive

ggplotly(corrplot)